Lightweight Fingerprints for Fast Approximate Keyword Matching Using Bitwise Operations

نویسندگان

  • Aleksander Cislak
  • Szymon Grabowski
چکیده

We aim to speed up approximate keyword matching by storing a lightweight, fixed-size block of data for each string, called a fingerprint. These work in a similar way to hash values; however, they can be also used for matching with errors. They store information regarding symbol occurrences using individual bits, and they can be compared against each other with a constant number of bitwise operations. In this way, certain strings can be deduced to be at least within the distance k from each other (using Hamming or Levenshtein distance) without performing an explicit verification. We show experimentally that for a preprocessed collection of strings, fingerprints can provide substantial speedups for k = 1, namely over 2.5 times for the Hamming distance and over 10 times for the Levenshtein distance. Tests were conducted on synthetic and real-world English and URL data.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Buddy-RAM: Improving the Performance and Efficiency of Bulk Bitwise Operations Using DRAM

Bitwise operations are an important component of modern day programming. Many widely-used data structures (e.g., bitmap indices in databases) rely on fast bitwise operations on large bit vectors to achieve high performance. Unfortunately, in existing systems, regardless of the underlying architecture (e.g., CPU, GPU, FPGA), the throughput of such bulk bitwise operations is limited by the availa...

متن کامل

Bitlist: New Full-text Index for Low Space Cost and Efficient Keyword Search

Nowadays Web search engines are experiencing significant performance challenges caused by a huge amount of Web pages and increasingly larger number of Web users. The key issue for addressing these challenges is to design a compact structure which can index Web documents with low space and meanwhile process keyword search very fast. Unfortunately, the current solutions typically separate the spa...

متن کامل

Fast Approximate Matching Algorithm for Phone-based Keyword Spotting

Generally, exact matching is widely used for keyword spotting (KWS). Its performance depends heavily on the recognition accuracy. As for phone-based KWS system, the influence of phoneme error rate (PER) on KWS increases as the length of phoneme sequence for the keyword grows. Approximate matching is an alteration to compensate errors in recognition. Compared to exact matching, the calculation c...

متن کامل

File Detection in Network Traffic Using Approximate Matching

Virtually every day data breach incidents are reported in the news. Scammers, fraudsters, hackers and malicious insiders are raking in millions with sensitive business and personal information. Not all incidents involve cunning and astute hackers. The involvement of insiders is ever increasing. Data information leakage is a critical issue for many companies, especially nowadays where every empl...

متن کامل

A Unified View to String Matching Algorithms

We present a uniied view to sequential algorithms for many pattern matching problems, using a nite automaton built from the pattern which uses the text as input. We show the limitations of deterministic nite automata (DFA) and the advantages of using a bitwise simulation of non-deterministic nite automata (NFA). This approach gives very fast practical algorithms which have good complexity for s...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1711.08475  شماره 

صفحات  -

تاریخ انتشار 2017